Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

116

Binary Neural Architecture Search

TABLE 4.3

Search eﬃciency for diﬀerent search strategies on ImageNet, including previous NAS

in both the real-valued and 1-bit search space, random search, and our DCP-NAS.

Method

T.P.

GGN

D.O.

Top-1 Acc.

Search Cost

Real-valued NAS

PNAS

74.2

225

DARTS

73.1

PC-DARTS

75.8

3.8

Direct BNAS

BNAS1

64.3

2.6

BNAS2-H

63.5

Random Search

51.3

4.4

Auxiliary BNAS

CP-NAS

66.5

2.8

DCP-NAS-L

71.4

27.9

DCP-NAS-L

71.2

2.9

DCP-NAS-L

72.6

27.9

DCP-NAS-L

72.4

2.9

Note: T.P. and D.O. denote Tangent Propagation and Decoupled Optimization, respectively.

the tangent direction constraint and the reconstruction error can improve the accuracy on

ImageNet. When applied together, the Top-1 accuracy reaches the highest value of 72.4%.

Then we conduct experiments with various values of λ and μ as shown in Figure 4.15. We

observe that with a ﬁxed value of μ, the accuracy of Top-1 increases in the beginning with

increasing λ, but decreases when λ is greater than 1e-3. When λ becomes larger, DCP-NAS

tends to select the binary architecture with a gradient similar to that of its real-valued coun-

terpart. To some extent, the 1-bit model’s accuracy is neglected, leading to a performance

drop. Another phenomenon of performance variation is that the accuracy of Top-1 increases

ﬁrst and then decreases with increasing μ while λ contains ﬁxed values. Too much atten-

tion paid to minimizing the distance between 1-bit parameters and their counterparts may

introduce a collapse of the representation ability to 1-bit models and severely degenerate

the performance of DCP-NAS.

To better understand the acceleration rate of applying the Generalized Gauss-Newton

(GGN) matrix in the search process, we conducted experiments to examine the search cost

with and without GGN. As shown in Table 4.3, we compare the searching eﬃciency and the

accuracy of the architecture obtained by Random Search (random selection), Real-valued

NAS methods, Binarized NAS methods, CP-NAS, DCP-NAS without GGN method, and

DCP-NAS with GGN applied. In a random search, the 1-bit supernet randomly samples

and trains an architecture in each epoch, then assigns the expectation of all performance

to each corresponding edge and operations, and returns the architecture with the highest

score, which lacks the necessary guidance in the search process and therefore has poor per-

formance for binary architecture search. Notably, our DCP-NAS without GGN is highly

computationally consumed for the second-order gradient, which is necessarily computed in

the tangent propagation. Note that directly optimizing two supernets is computationally

redundant. However, the introduction of GGN for the Hessian matrix signiﬁcantly acceler-

ates the search process, reducing the search cost to almost 10% with a negligible accuracy

vibration. As shown in Table 4, with the use of GGN, our method reduces the search cost

from 29 to 2.9, which is more eﬃcient than DARTS. Additionally, our DCP-NAS achieves a